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Abstract 

When the air traffie demand is expeeted to 
exeeed the available airport’s eapaeity for a short 
period of time, Ground Stop (GS) operations are 
implemented by Federal Aviation Administration 
(FAA) Traffie Flow Management (TFM). The GS 
requires departing aireraft meeting speeifie eriteria to 
remain on the ground to aehieve redueed demands at 
the eonstrained destination airport until the end of the 
GS. This paper provides a high-level overview of the 
statistieal distributions as well as eausal faetors for 
the GSs at the major airports in the United States. 
The GS’s eharaeter, the weather impaet on GSs, GS 
variations with delays, and the interaetion between 
GSs and Ground Delay Programs (GDPs) at Newark 
Liberty International Airport (EWR) are investigated. 
The maehine learning methods are used to generate 
elassifieation models that map the historieal airport 
weather foreeast, sehedule traffie, and other airport 
eonditions to implemented GS/GDP operations and 
the models are evaluated using the eross-validations. 
This modeling approaeh produeed promising results 
as it yielded an 85% overall elassifieation aeeuraey to 
distinguish the implemented GS days from the 
normal days without GS and GDP operations and a 
71% aeeuraey to differentiate the GS and GDP 
implemented days from the GDP only days. 

1. Introduction 

Air traffie eongestion at the major eommereial 
airports has been a serious problem in the National 
Airspaee System (NAS), espeeially during inelement 
weather. FAA’s TFM manages air traffie flow to 
balanee the air traffie arrival demand against airport 
eapaeity in eases of adverse weather or other 
eireumstanees while the latter is redueed. At the 
airports in the United States, when the air traffie 
demand is estimated to exeeed the airport’s eapaeity 
for a short period of time, a GS, one of taetieal TFM 
aetions, may be enaeted by FAA air traffie eontrol. 

A GS is a proeedure requiring aireraft that meet 
speeifie eriteria to remain on the ground at their 
origin airports, to ensure that aireraft destined for the 


affeeted airport are not released until the operational 
situation allows [1]. Normally GSs are reaetive to the 
eurrent situation when traffie eontrol is unable to 
safely aeeommodate additional aireraft in the system. 
They are most frequently used to preelude extended 
periods of airborne holding or to prevent the airports 
from reaehing gridloek. GSs are eonsidered to be one 
of the most restrietive Traffie Management Initiatives 
(TMIs) and they override all other TMIs that are used 
to manage air traffie flows in the National Airspaee 
System (NAS). 

When the projeeted arrival traffie demand 
exeeeds the airport eapabilities for a long period of 
time, GDPs are implemented by TFM as strategie 
aetions. A GDP is a proeedure requesting delays of 
some flights at their departure airport in order to 
reeoneile demand with eapaeity at their arrival 
airport. GDPs are usually a result of adverse weather 
eonditions. Unlike GS, a GDP is more sophistieated 
and user-friendly; TFM issues not only GDP 
parameter, sueh as GDP start time, GDP duration, 
ete., but also an Expeeted Departure Clearanee Time 
(EDCT) assigned for eaeh affeeted flight. Therefore 
the airlines know the amount of delay for eaeh 
aireraft and eould manage its EDCT in their best 
interests. Without the information for the aireraft ’s 
EDCT during GS operations, it is very hard for any 
airline to determine the departure times for GS 
affeeted flights. Furthermore, if the projeeted time 
during a GS is longer than that expeeted due to 
inaeeurate predietion of demand and foreeast, TFM 
may extend the GS duration, use multiple GSs or 
make a TMI transition from a GS into a GDP. These 
TMI’s interaetions eould eause some results less 
predietable and desirable [2]. 

In reeent years, a number of weather indueed 
TMI studies have been emerged in the literatures [3- 
7]. In spite of that, to the best of the author’s 
knowledge, there have not been any published studies 
seeking to analyze and prediet whether a GS 
operation is neeessary or not. This study provides a 
high-level overview of the GS statistieal 
distributions, eause faetors, and the weather impaets 
on GSs at the major airports in the United States. The 


OS’s characters, GS variation with airport demands 
and delays, and the interactions of GSs and GDPs at 
Newark Liberty International Airport (EWR) were 
investigated. Machine learning classification 
algorithms were employed for providing predictions 
about whether a particular GS alone or GS and GDP 
combined may be applied to manage arrivals destined 
for EWR airport. 

The paper makes the use of Ensemble Bagging 
Decision Tree (BDT) classifications to predict GS or 
GS/GDP operations during bad weather. The strategy 
is to develop predictive BDT models utilizing 
historical GS, GDP, and weather forecast training 
data, and then to apply these models on test data to 
suggest whether a GS or GS/GDP should be planned. 
The prediction outlooks are then discussed. 

The data mining algorithm and cross validation 
approach is described in Section 2. The National 
Traffic Management Log (NTML), the FAA Aviation 
System Performance Metrics (ASPM), and Rapid 
Updated Cycle (RUC) data sources are outlined in 
Section 3. The historical analysis of GS operations is 
presented in Section 4, while the data mining 
predictions are described in Section 5. Finally a 
summary of the results is submitted in Section 6. 

2. APPROACH AND MODELING 
METHODOLOGY 

The Ensemble Bagging Decision Tree model 
(BDT) was used to predict the requirement of GS 
operations on both normal and GDP implemented 
days. The supervised machine learning was 
applied on training data to generate the BDT 
models and the models were validated by the 
cross validation methods. 

Ensemble Bagging Decision Tree 

Ensemble methods adopt multiple machine 
learning decision tree models to obtain a better 
predictive performance than that any of its 
individual constituent members can produce. 
Bagging stands for bootstrap aggregation. Bootstrap 
aggregation is a machine learning ensemble meta- 
algorithm designed to improve the stability and 
accuracy of machine learning algorithms used in 
statistical classification and regression [8]. In 


classification scenarios; the random resampling 
procedure in bagging induces some classification 
margin over the dataset. Additionally; when 
bagging is performed in different feature 
subspaceS; the resulting classification margins are 
likely to be diverse; which is essential for an 
ensemble to be accurate. The method takes into 
account of the diversity of classification margins 
in feature subspaces to enhance the behavior of 
bagging. First; it studies the average error rate of 
bagging; converts the task into an optimization 
problem for determining some weights for feature 
subspaces. Then; it assigns the weights to the 
subspaces via a randomized fashion in classifier 
construction. Experimental results demonstrate 
that the ensemble method is robust to 
classification noise and often generates superior 
predictions than any single classifier can do (see 
for example, [9-10]). In this study; the BDT 
classification model is implemented using the 
MATLAB TreeBagger function [11]. 

Several features of bagged decision trees make 
TreeBagger a unique algorithm. Drawing the same 
number of samples out of all training observations 
with replacement is expected to have a 63.2% of 
unique observations for a large number of training 
data. So the process omits on average 36.8% of 
observations for each decision tree, called as "out-of- 
bag" observations. These "out-of-bag" observations 
can then be used to estimate the feature importance 
by randomly permuting out-of-bag data across one 
input variable at a time and estimating the increase in 
the out-of-bag error due to this permutation. The 
larger the error increases, the more important the 
feature is. Thus, the feature importance can be 
obtained in the process of training, which is an 
attractive character of the TreeBagger. 

Model Validation Methods 

The machine learning models are constructed 
from an initial random state and ending with a 
trained state using training data sets and are 
tested or validated using a different data set. 
There are a number of validation approaches 
available. Among them; the very popular cross- 
validation approach has been frequently used by 
researchers. 


In cross-validation, a series of BDT models 
are constructed each time by dropping a different 
part of the data from the training set and applying 
the resulting model to the dropped data to predict 
the target. The merged series of predictions for 
dropped or test data are checked for accuracy 
against the observations. In one version of the 
cross-validation approach, called group cross- 
validation approach, data are divided into N 
groups. A total of N models are then constructed 
one by one using N-1 data groups for model 
training, and the remaining group is used for 
testing. At the end of this procedure, all 
predictions assembled from the dropped cases are 
compared with the observed targets to compute 
validation of model error for the cross-validation 
result. The ten-fold cross-validation is used in this 
study. 

Performance Measures 

A number of methods are available to evaluate 
the performanee of binary elassifiers. For a elassifier 
with any given diserimination threshold, the number 
of eases eorreetly and ineorreetly elassified ean be 
eomputed. This gives a eonfusion matrix with four 
numbers as shown in Table 1. YY is the number of 
true positives, i.e., how many eases are estimated by 
elassifier as “Yes” events, whieh aetually are “Yes” 
events. Similarly we ean define NN as the number of 
true negatives, NY as the number of false positives 
and YN as the number of false negatives. Using the 
statisties generated in Table 1, some frequently 
adapted elassifier performanee evaluation methods 
are deseribed briefly below. More information about 
these methods ean be found, for example, in Refs. 
[12-13]. 


Table 1. Confusion Matrix for Dichotomous 
(UYes’T’No”) Events 



Actual Observation 

Yes 

No 

Classifier 

Prediction 

Yes 

YY 

YN 

No 

NY 

NN 


The Overall Aeeuraey Rate (OAR) is defined as 
OAR= (YY+NN) / (YY+YN+NY+NN). It has a range 
of 0 to 1. “1” is the best elassifieation performanee 
seore. The probability of deteetion (POD), also ealled 


as preeision, is the proportion of “Yes” observed 
events that were eorreetly predieted, POD=YY / 
(YY-^NY). The probability of false alarm (PFA), also 
ealled as false alarm ratio, is the proportion of “No” 
observed events that were not eorreetly estimated as 
“Yes” predieted events, PFA = YN / (YY + YN). Its 
values also range from 0 to 1. If YN= 0, then the 
seore goes to 0, the best one ean expeet. The Critieal 
Sueeess Index (CSI) is the proportion of true 
positives that were either estimated or observed. CSI 
= YY / (YY + YN + NY). Its values range from 0 to 1 
with a value of 1 indieating a perfeet elassifieation 
performanee seore. The PFA ean be eontrolled by 
deliberately under-predieting the event; sueh a 
strategy risks inereasing the number of missed 
events, whieh is not eonsidered in the PFA. For this 
reason, the POD and the PFA should both be 
eonsidered for a better understanding of the 
performanee of the foreeast. 

The OAR, POD, PFA, and CSI elassifier 
performanee measures are used in this researeh. 

3. DATA USED IN THE STUDY 

This seetion deseribes FAA National Traffie 
Management Log (NTML), the FAA Aviation 
System Performanee Metries (ASPM), and Rapid 
Updated Cyele (RUC) weather foreeast analysis data. 
The FAA NTML provides a single system for 
automated eoordination, logging, and eommunieation 
of Traffie Management Initiatives (TMIs), sueh as 
GS and GDP events, throughout the National 
Airspaee System. The ASPM souree provides airport 
speeifie information sueh as arrival delays, sehedule 
arrival, and arrival demand for the major US airports. 
The RUC was a National Oeeanie and Atmospherie 
Administration (NOAA) operational weather 
predietion system whieh generated high-frequeney 
numerieal weather foreeast until May, 2012 [14]. All 
data over the years 2007 through 2009 were derived 
from these data sourees. 

GS and GDP Event Data 

More than 8000 GS operation data at the major 
US airports were eolleeted for the years 2007-2009 
from the NTML database. The data were used for a 
high-level statistieal study on GS airport distributions 
and eausal faetors. 


Among these US airports, EWR airport has one 
of the highest GS and GDP event rates over the years 
2007-2009. During these three years, GSs and GDPs 
were implemented at EWR approximately 56% and 
54% of the days, respeetively. On these impaeted 
days, the aetual durations were about 1.5 hours and 9 
hours on average for GS and GDP, respeetively. 

The EWR GS and GDP data were eolleeted for 
eaeh hour and for eaeh day for the years 2007-2009. 
The hourly or daily data were partitioned into four 
sets based on whether the GS and GDP operations 
during a particular hour or day were carried out 
or not at EWR. The four groups are labeled as 
follows: GS/GDP for the one in whieh both GS and 
GDP earried out; GS/Non-GDP for the one with GS 
only; GDP/Non-GS for that GDP implemented 
without GS; and Non-GS/Non-GDP as the one 
without both for the hour or day investigated. Both 
hourly and daily data were used in GS statistieal 
studies. Only the daily data were used to generate and 
test the elassifieation model for predieting the GS 
operations. 

ASPM Data 

Observed airport hourly delays, sehedule arrival, 
arrival demand, airport arrival rates (AAR), and 
terminal weather data were eolleeted from the ASPM 
database. AAR is a dynamie parameter speeifying the 
number of arrival aireraft that an airport, in 
eonjunetion with terminal airspaee, ean aeeept under 
speeifie eonditions throughout any eonseeutive hour. 
Aetual hourly airport surfaee weather observation 
reports (METAR) ineluding wind, eeiling, visibility, 
and meteorologieal eondition flags are predominantly 
used by air traffie eontroller in air traffie 
management and by meteorologists in the weather 
foreeast modeling. ASPM data were preproeessed to 
eonvert eharaeter reeords to numerieal values with 
missing data being filtered out. The proeessed ASPM 
data were used in the statistieal analysis and also as 
inputs for generating and validating the maehine 
learning GS models. 

RUC Weather Data 

The RUC weather data were designed to provide 
aeeurate numerieal foreeast guidanee about severe 
weather and hazards for aviation users for the next 
several hour time period. RUC assimilates reeent 


weather observations aloft and at the surfaee to 
provide hourly updates of eurrent eonditions and 
short-range foreeasts using a sophistieated mesoseale 
model. The RUC model uses optimum interpolation 
analyses and ineorporates the surfaee analysis within 
3-D analysis to produee 3-D grids whieh eover a 
geographieal domain over mueh of North Ameriea, 
ineluding the entire eontiguous United States and 40 
levels in vertieal. The RUC grid, used for the 
modeling, has 40-km horizontal resolution with 151 x 
113 grid points on surfaee. 

RUC weather foreeasts in 6 -hour look-ahead 
time periods over the years 2007 through 2009 were 
eolleeted from the NOAA servers. Eaeh foreeast has 
151X113 grid points; there are 315 weather 
parameters per grid point. The data were 
preproeessed to seleet the grid point that is the elosest 
to EWR. Wind and storm moving speeds and 
direetions were ealeulated utilizing their RUC U and 
V eomponents. Only ten weather parameters were 
ehosen based on the EWR GS weather eausal faetors 
(wind and thunderstorm) and the feature importanee 
analysis (see Seetion 2) using the TreeBagger [15]. 

Table 2 lists the 10 RUC surfaee weather 
parameters and the numbers assoeiated with them. 
These pieked parameters earry very important 
weather information for air traffie eontrol. These 
variables ean be eategorized as follows: pressure 
(#1), wind and max wind (#2 to #5), visibility (#6), 
storm (#7 to #8), and lifted indexes (#9 to #10) whieh 
offer energy information on the intensities of severe 
weather. 


Table 2. RUC Forecast Parameters 


# 

RUC Forecast Parameters 

1 

Surface Pressure Tendency (PTEND) [Pa/s] 

2 

10 m above ground Wind Speed (WSGRD) [m/s] 

3 

max wind Pressure (MWPRES) [Pa] 

4 

max wind Speed (MWS) [m/s] 

5 

Surface Gust Wind Speed (GUST) [m/s] 

6 

Surface Visibility (VIS) [m] 

7 

Surface Storm Relative Helieity (HLCY) 
[m''2/s''2] 

8 

Surface Storm Motion Speed (SSMS) [m/s] 

9 

Surface Lifted Index (LFTX)[K] 

10 

Surface Best Lifted Index to 500 mb (BLI) [K] 


4. STATISTICS OF GROUND STOPS 

More than 8000 GS events for all US airports 
from the year 2007 through 2009 were collected 
from NTML. This data was used to generate the 
distributions reflecting the GS activities at the US 
airports. Section 4.1 describes the activity levels as 
well as the underlying factors that normally drive 
the events. In the remainder of this section, 
historical GS analysis at EWR airport is presented 
in terms of the time series distribution, demand 
and delay analysis, and the usage in conjunction 
with GDP programs. 

4.1 GS Analysis of the U.S. Airports 

A distribution of the GSs at U.S. airports over 
years 2007-2009 is given in Fig. 1. The top six 
impacted airports were Newark Liberty 
International Airport (EWR), LaGuardia Airport 
(EGA), Atlanta International Airport (ATL), 
Chicago O’Hare International Airport (ORD], 
Philadelphia International Airport (PHL], and John 
F. Kennedy International Airport QFK). They 
accounted for 13%, 9%, 8%, 7%, 7%, and 6% of all 
GSs respectively and the other airports (with less 
than 4% for each] took up the remaining 50% of 
the operations. 

EWR 13% 


LGA 9% 


ATL 6% 

ORD 7% 

PHL 7% 

JFK 6% 



Figure 1. GS Distribution at the U.S. Airports 


cause (80%] for the GSs at all airports. For the 
other "non-weather” causal factors, the presence 
of "Volume" related GSs at these airports was also 
noteworthy, since they account for more than 
12% of all GSs. In this figure, "Volume” is used to 
indicate the air traffic congestion at the arrival 
airports. 


Runway/Equipment 5% 



Figure 2. Causal Factors for GSs at the U.S. 

Airports 

The diverse weather causes at the sub- 
category level for the U.S. airports through the 
years 2007-2009 are shown in Fig. 3. It 
demonstrates that the most serious weather 
component for GS operations was the 
"Thunderstorms” which accounted for 46% of 
weather impacted GSs. 



The causal factors, as recorded in the NTML 
database are shown in Fig. 2. As can be seen from 
this plot, "Weather” was the predominant stated 


Figure 3. Weather Subcategory Causal Factors 
for GSs at the U.S. Airports 


The diverse weather causes for each of the 
top-six U.S. airports over the years 2007-2009 are 
shown in Fig. 4. The weather causal factors were 
different for different airports. For GSs at EWR 
airport, the top three causal factors were "Wind" 
with 41%, "Thunderstorms" with 26%, and "Low 
Ceilings/Fog" with 20% of the total number of GSs 
caused by weather. 



Airport 


actual GS stop time and the GS start time. Table 3 
shows the average of planned and actual 
durations for the GSs caused by weather and non- 
weather at the top six airports. The averages of GS 
durations were all around one hour. For those GSs 
caused by weather for the six airports, the 
averages of actual durations were up to 30 
minutes longer than that originally planned. The 
differences between averages of actual and 
planned durations for those GS caused by 
runway/equipment, volume, or other non- 
weather reasons were relatively small, around a 
few minutes. 

The remainder of this paper focused on the 
study of those GSs implemented at EWR airport 
where the highest GSs incidence of 13% took 
place, as shown in Fig. 1. 


4.2 EWR GS Statistics 

Temporal usage statistics (e.g., monthly, daily 
and hourly) for GS operations at EWR are 
exhibited in Fig. 5. 


Figure 4. Weather Causal Factors for GSs at the 
Top 6 U.S. Airports 

Table 3. GS Durations (hours) for the Top 6 
Airports 


Airport 

Weather 

Non-Weather 

Average 

Planned 

Duration 

Average 

Actual 

Duration 

Average 

Planned 

Duration 

Average 

Actual 

Duration 

EWR 

1:14 

1:30 

1:07 

1:08 

EGA 

1:12 

1:37 

1:04 

1:01 

ATL 

1:06 

1:11 

1:05 

0:54 

ORD 

1:09 

1:20 

1:01 

1:15 

PHL 

1:11 

1:24 

1:04 

0.58 

JFK 

1:18 

1:49 

1:09 

1:10 


The GS start time and planned stop time were 
issued by TFM when a GS was implemented. The 
GS planned duration is defined as the difference 
between the GS planned stop and the GS start 
time. During a GS, these program parameters 
might need to be revised because of changing 
weather or operation conditions. GS revisions may 
lead to further GS stop time substitutions and the 
actual duration is the time duration between the 



100 

50 

0 


100 

501 - 

0 


Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec 

Month 

1. h 1 

LI 

I ll li 1. 

Sun Mon 

rue Wed Thu Fri Sat 

Weekday 


bLhmat.. 


5 6 7 8 9 10 11 12 13 1415 16 17 18 19 20 21 22 23 

GS Start Local Time (Hour) 


Figure 5. Temporal Usage Statistics for GSs at 
EWR Airport 

The data were divided in terms of weather 
(blue bars) and non-weather (red bars) events. 
Starting with the monthly usage statistics, which 
appears in the upper-most image in Fig. 5, it is 
noted that there tends to be more weather-related 
GS operations in the summer months (June 
through August), while "Non-weather" related GS 
operations are almost flat - no consistent pattern of 


monthly peaking. In terms of the weekly usage of 
GS operations at EWR (see the middle image in 
Fig. 5), the number of operations was fairly 
constant with a noticeable decreased in the usage 
on Saturdays, which was to be expected since the 
arrival demand also tended to be lower on 
Saturdays. Finally, the hourly patterns of the 
profiles (see the bottom image in Fig. 5) are fairly 
apparent, i.e. the GS operations tended to peak 
between 10:00am and 8:00 pm local time (Eastern 
Daylight Time, EDT], which coincided with the 
more arrivals destined for the airport. 

Using the TMI report time as the TMI issue 
time, the time difference between the TMI 
implemented start time and the TMI issue time 
may indicate how well the TMI action is planned. 
The time differences for the EWR GS events and 
GDP events without GS interactions from the year 
2007 through 2009 are shown in Fig. 6 (a) and (b] 
respectively. The fact that the GS at EWR 
frequently started at the issue time (see in Fig. 6 a] 
suggests that in general the GS was the reactive 
response when a sudden and unexpected 
imbalance of airport demand and capacity 
occurred. In contrast to GS, the EWR GDP issue 
time was earlier than GDP start time by two hours 
on average (see Fig. 6 b). 



(a) GS Start Time - Issue Time (Hour) 



Figure 6. The Time Difference between Start Time 
and Issue Time for GSs (a) and GDPs (b) at EWR 
Airport 

The GS planned duration and actual duration 
versus the GS start time for all EWR GS operations 


from 2007 through 2009 are shown in Fig. 7. The 
time distributions of GS planned and actual 
durations are list in Table 4. As expected, the GS 
planned duration was relative short, it was less 
than 2 hours 98% of time (see Table 4), and not 
influenced by the start time (Fig. 7 a]. However, 
the actual durations often extended and 
occasionally (with a 4% of time, see Table 4) 
lasted for 3 to 6 hours (Fig. 7 b). 



(a) GS Start EDT (Hour) (b) GS Start EDT (Hour) 

Figure 7. EWR GS Planned (a) and Actual (b) 
Duration versus GS Start Time 


Table 4. EWR Planned and Actual GS Time 
Distributed Percentages 


GS Counts 
Percentage 

<1 

Hour 

>=1 & 

< 2 Hour 

>=2& 
<3 Hour 

>=3 

Hour 

Planned 

Duration 

19% 

79% 

2% 

0% 

Actual 

Duration 

27% 

61% 

8% 

4% 


4.3 GS Variations with Demands and Delays 

Conceptually, GS or GDP operations are used 
during the hours with imbalance of arrival 
demand and airport capacity. It may lead to higher 
delays for the airport arrivals. To test this, the 
EWR demand and delay data from 2007-2009 
were partitioned into four sets based on whether 
the GS and GDP were operated or not during the 
hour. 

The EWR hourly GS and GDP count 
percentages from local time 5 am to midnight over 
2007-2009 are listed in Table 5. During this time 
period, it can be seen that non-GS and non-GDP 
incidence accounted for 68% of time; followed by 
GDP only operations at 20% and GS with/without 


GDP actions each occupied only a small portion, 
i.e. 6% of time. 


Table 5. EWR Hourly GS and GDP Count 
Percentages 


Hour 

Counts 

GS/ 

GDP 

GS/ 

Non-GDP 

GDP/ 

Non-GS 

Non-GS/ 

Non-GDP 

Percentage 

6% 

6% 

20% 

68% 


The ratios of EWR arrival demand over the 
airport capacity, AAR, are presented in Figure 8 
where the histogram (a) presents the hourly ratio 
counts for the hours with both GS and GDP 
operated (GDP/GS], (b) or (c] for the hours with 
GDP only (GDP/Non-GS) or GS only (GS/Non-GDP) 
events, and (d) for the hours without both GDP 
and GS operations (Non-GDP/Non-GS). The ratio 
would be greater than one when the arrival 
demand exceeds the airport capacity. Fig. 8 
reveals that the ratio of EWR demand and AAR 
was much larger than 1 during GDP operation 
hours, just above 1 during GS implemented hours, 
and surely, the ratio on the normal days without 
any GDP and GS hours was peaked at less than 1. 
The fact that the ratio for the GDP hours is larger 
than that for the GS hours suggests that the GSs 
were mostly required for resolving relatively 
small imbalances while GDPs were used to 
recover the arrival demands. 



Figure 8. The Ratio of Hourly Demand and AAR 
for EWR GS and GDP Events 


anticipated, the arrival delays during GDP hours 
were greater than that from GS only delays, and 
naturally, the corresponding delays without GDP 
and GS were the least among all. 



(a) Delay Minutes (GDP/GS Hour) 



(d) Delay Minutes (Non-GDP/Non-GS Hour) 

Figure 9. Effect of GSs and GDPs on EWR Hourly 
Schedule Arrival Delays 

The airborne delay minutes are presented in 
Figure 10 where the histogram (a) is for GDP/GS 
hours, (b) or (c) for GDP/Non-GS or GS/Non-GDP 
hours, and (d) for Non-GDP/Non-GS hours. Fig. 10 
reveals that the airborne delays during GS 
implemented hours were greater than the delays 
during GDP hours. And the airborne delays for 
non-GDP/non-GS hours were similar to the 
GDP/non-GS hours. The fact that the GS involved 
airborne delays were longer than that for other 
cases signifies that the implementation of the GS 
was affected by the airborne delays and was used 
to preclude extended period of airborne holding 
for the arrivals destined to the airport. 



Del?iy MiniJfflfi fGDP/GG I loir) 



Dal^v MiniJtftR (Non GTOP/Nor GG I lour) 


The hourly scheduled arrival delays in 
minutes are presented in Figure 9 where the 
histogram (a) is for GDP/GS hours, (b) or (c) for 
GDP/Non-GS or GS/Non-GDP hours, and (d) for 
Non-GDP/Non-GS hours. Fig. 9 shows that as 


Figure 10. Influence of GSs and GDPs on EWR 
Hourly Airborne Delays 


4.4 GS and GDP Interactions 

The EWR GDP time durations were about 
nine hours on average [5], so only one GDP could 
be implemented per day (from local time 2 AM to 
next 2 AM) for the years 2007-2009. The GSs were 
much shorter; sometimes multiple GSs could be 
enacted on the same day. The EWR daily GS/GDP 
implementation percentages for the years 2007- 
2009 with 1096 days in total are listed in Table 6. 
It shows that there's a 56% of days on which GSs 
were enacted; a 35% of days that both GS and GDP 
were implemented; a 25% of days that none of 
them required, and 21% and 19% of days for GS 
only and GDP only operations, respectively. 


Table 6. EWR Daily GS and GDP Percentages 


Days 

GS/ 

GDP 

GS/ 

Non-GDP 

GDP/ 

Non-GS 

Non-GS/ 

Non-GDP 

Percentage 

35% 

21% 

19% 

25% 


The EWR daily GS count percentages on those 
days with GDP (35% in Table 6) and without GDP 
(21% in Table 6) over 2007-2009 are listed in 
Table 7. It displays that on GS/GDP and GS/Non- 
GDP days, the percentages that multiple GS 
incidents occurred are 42% and 48% times, 
respectively. Meanwhile more than three GS 
activities were operated at 3% times regardless 
whether GDP happened or not. Counting all 
multiple GS cases together, they were carried out 
a 25% of days (35%*42%+21%*48%). 


Table 7. EWR GS Counts/Day Percentages 


GS Counts/Day 

1 

2 

3 

>3 

GDP Day (35%) 

58% 

29% 

10% 

3% 

Non-GDP (21%) 

52% 

31% 

14% 

3% 


Four typical GS implemented days during the 
summer of 2008 are shown in Fig. 11. The time 
along the x axis shown in the figure ranges from 2 
AM to next 2 AM EDT. The red lines in the figure 
represent the GS events and the blue lines indicate 
the GDP events. The top two plots in Fig. 11 depict 
multiple GS activities on 8/18/2008 and 
8/19/2008 when no GDP occurred. On 
8/18/2008, the first GS started (red line jumped 
from 0 to 1) at 13:34 local time (EDT) and ended 
at 14:54 (dropped froml to 0); the second one 
started at 16:05 and ended at 17:09 (see the top 


image in Fig. 11). On 8/19/2008, three GSs 
(13:55-14:39, 15:20-16:55, and 17:50-19:10) 
were implemented on the day (see second 
histogram from the top in Fig. 11). 

The bottom two diagrams show the events 
happened on the two ordinary GS/GDP days one 
on 6/18/2008 and the other on 7/17/2008. There 
were two GSs implemented on 6/18/2008 and 
three GSs on 7/17/2008. From the plot for the 
incidence on 6/18/08, the GDP started from 12:33 
ended at 00:38 on the next day. The two GSs 
(15:48-19:30 and 21:01-22:30) were enforced 
during GDP hours. From the 7/17/2008 image, 
the GDP was started at 19:30 and continued until 
00:59 the next day. The three GSs (12:09-12:54, 
15:09-16:19, and 17:30-19:45) were implemented 
before the GDP. 
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Figure 11. Four Examples of the Multiple GS 
Implemented Days during the summer of the Year 
2008 

If multiple GSs arise together very closely, it 
can induce the degree of uncertainty on the 
operations of the affected aircraft. To study the 
impact of the multiple GSs, two variables are 
introduced in order to characterize the closeness 
of the GSs. The first one is the sum duration for 
multiple GSs defined as the sum of GS durations. 
The second is the distributed duration denoted as 
the difference between the end time of the latest 
GS and the start time for the earliest GS. If the sum 
duration value is closed to distributed duration, 
the multiple GSs are not far apart. The multiple 
GSs distributed duration vs. the sum duration for 
GDP and Non-GDP days are shown in Fig. 12 (a) 


and [b]. The distributed durations are clustered 
closely on GS/non-GDP days, whereas the plot is 
more dispersed on GS and GDP days. 



Figure 12. EWR GS Distributed Duration vs. Sum 
Duration for GS/Non-GDP Days (a) and GS/GDP 
Days (b) 

The enacted GSs before or during GDP events 
can have some influence on the GDP planned 
variables, such as the GDP issue time, start time 
and the GDP planned durations. Figure 13 shows 
the EWR GDP planned duration vs. GDP start time 
for GS days (a) and non-GS days (b) for the years 
2007-2009. The figure reveals that the GDP can 
start anytime during the GS/GDP days, however 
the GDP were only enacted no later than 2:30 pm 
local time during the GDP/Non-GS days. On those 
GS/GDP days, the GDPs starting after 2:30 pm 
accounted for 17% of time. 
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Figure 13. EWR GDP Planned Duration vs. GDP 
Start Time for the GS/GDP Days (a) and for the 
GDP/Non-GS Days (b) 

Figure 14 displays the time difference 
between the GS start time and GDP start time for 
GDP started after 2:30 pm local time (a] and 
before 2:30 pm (b) on the GDP/GS days at EWR 
during the years 2007-2009. Fig. 14 (a) shows that 
the all GSs were started early and then 
transformed into a GDP on the GS/GDP days when 
GDP started after 2:30pm [for example, see 


7/17/2008 in Fig. 11]. This happened on 6% of 
the days investigated (35%*17%). In cases where 
GDP events started before 2:30 pm [see Fig. 14 b), 
there was a roughly 25% of time in which the GSs 
took place at least half an hour earlier than the 
GDP. This appeared on about 7% of the days 
studied [35%*83%*25%]. 




GS Start Time - GDP Start Time (Hour) 


Figure 14. The Difference of GS Start and GDP 
Start Time for EWR GS/GDP Days with GDP 
Start Time after 2:30pm (a) and at or before 
2:30pm (b) 

Figure 15 shows the time differences 
between the GDP start and issue times on the 
GDP/GS days with the GDP starting [a] after 2:30 
pm local time and [b] at or before 2:30 pm at EWR 
during the years 2007-2009. The time difference 
of two hours on average between the GDP issued 
and the GDP implemented time on GDP/Non-GS 
days [see Fig. 6 b] indicates that the GDP events 
were well planned without the GS appearance. In 
contrast to GDP/Non-GS days, on GS/GDP days, 
the GDP issue time were not much earlier than the 
GDP start time, especially for the case shown in 
Fig. 15 [a]. The noticeable zero peaks in Fig. 15 [a] 
and [b] suggest that the GDPs were implemented 
at the same time as the GDP issue time when TFM 
made the transition from a GS into a GDP. 
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Figure 15. The Time Difference between GDP 
Start Time and Issue Time for GDP/GS Days with 
GDP Start Time>2 :30pm (a) <=2 :30pm (b) 

As a summary, the following observations of the 
EWR GSs over the years 2007-2009 were made from 
the statistieal analysis presented in this seetion: 

(a) GSs were enacted reactively to an 
unexpected imbalance of airport demand and 
capacity and used to preclude extended 
airborne holdings. 

(b) 12% of the actual GS durations were 
longer than 2 hours and 4% of them were 
between 3 and 6 hours. 

(c) The multiple GSs were enaeted in 25% of the 
days. 

(d) 35% of the days in the three years had GSs 
and a GDP implemented on same days. About 
13% of them, TFM made a TMI transition from 
a GS into a GDP event. 

These observations demonstrate that the GS 
is an important TFM action for reducing the 
imbalance of airport demand and capacity. 
However the findings that the actual GS durations 
frequently extended from the planned ones, along 
with the facts that multiple GSs often necessary, 
and some transformed into GDPs occasionally, 
made the predictability of GS operations difficult 
at EWR airport. In order to better estimate and 
manage the requirement for the GS handling, the 
BDT model trainings and validations are 
presented in the remainder of this study in an 
attempt to forecast the GS operations based on the 
past experience. The methods may have potential 


in helping the TFM specialists to identify the 
better operations to control the air traffic destined 
to the constrained EWR airport. 

5. CLASSIFICATION RESULTS 

This section contains the classification results 
obtained using the Ensemble Bagging Decision Tree 
models to (1) predict the usage of GS operations on 
the Non-GDP days, to (2) forecast the GDP usage on 
the Non-GS days, to (3) distinguish the same day 
usage of both GS and GDP operations from the 
normal days (Non-GS/Non-GDP), and to (4) assess 
the usage of GS operations on the GDP days. In all 
four cases, supervised machine learning was used to 
train the BDT binary classification models, and the 
model validation was accomplished with ten-fold 
cross validation. 

In this analysis, the “prediction start time” is 
taken as the hour one hour earlier than the start time 
of GS or GDP whichever came first. For example, 
12:00 pm was used as the prediction start hour on 
8/18/2008 (see in Fig. 11, the earliest GS began at 
13:34). On the normal days (Non-GS/Non-GDP), 
11:00 am EDT, just before the start of heavy air 
traffic at EWR, was used as the prediction start hour. 
The BDT models were trained and tested by using the 
ASPM EWR airport conditions, ASPM EWR 
terminal METAR weather data, 6-hour look-ahead 
EWR schedule arrival, as well as EWR 6-hour RUC 
forecast data at the prediction start hour as inputs. 
Note in contrast to that GS issue time was the same 
as the GS start time on average; the prediction start 
time is always selected at the hour earlier than the GS 
start by at least an hour. 

5.1 Prediction of GS days 

The ability to predict the GS requirement 
days may have potential to aid TFM in preparing 
for the situations. In order to estimate if GSs were 
required or not on a given non-GDP day, the non- 
GDP days were grouped into two classes labeled 
as “Yes" and “No" respectively. The “Yes" class 
includes those when at least one GS was required, 
while the class “No" is for the days without any GS 
or GDP events. Using the binary indicator 
responses of the GS usage as targets, the BDT 
classification models were first developed and 


trained, and subsequently applied to the test data 
for prediction purposes. 

The prediction result at the prediction start hour 
for EWR airport is shown in Table 8. Out of the 387 
non-GDP days, 167 days had at least one GS enacted. 
The prediction accuracy of the BDT binary classifier, 
which is given by OAR, is the proportion of correct 
results, (123+206)7(387) = 0.85. Out of a total of 167 
observed GS days, the number of correctly predicted 
days was 123. The precision is then given by 
123/167= 0.74 (see POD in Table 8). Out of a total of 
137 predicted GS days, the number of false predicted 
day was 14. The false alarm ratio is then given by 
14/137=0.10 (see PFA in Table 8). Out of a total of 
181 (123+14+44) observed and predicted GS days 
the correctly predicted days were 123. The Critical 
Success Index (CSI) is then given by 123/181=0.68. 
Overall, by comparing and verifying with the 
observation data, the BDT model seems to perform 
well on predicting the required GS operations. A 
review of the GS implemented at these conditions 
may help to improve the predictability of the GS 
operations. 


Table 8. Prediction of the EWR GS Days 


EWR GS Day 
Predictions 

Actual Observation 

Yes 

No 

Sum 

BDT 

Prediction 

Yes 

123 

14 

137 

No 

44 

206 

250 

Sum 

167 

220 

387 


OAR;85%, POD;74%, PFA;10%,CSI;0.68 


5.2 Prediction of GDP Days 

In parallel with the prediction of the GS days, 
the prediction of GDP operations during non-GS 
days was also performed using BDT models. To 
determine if a GDP was required or not on a given 
non-GS day, the non-GS days were grouped into 
two classes labeled as "Yes” and "No" respectively. 
The "Yes” class was used to indicate that a GDP 
was required on a particular day, while the class 
"No” to indicate none of GDP was required on a 
given day. 

The prediction on if a GDP is required or not at 
the prediction start hour for the EWR airport is 
shown in Table 9. Out of the 367 non-GS days, 147 


days had GDP implemented. The accuracy of the 
BDT model prediction, OAR, is 0.86. The precision 
(POD) is 0.80. The false alarm ratio (PFA) is 0.15. 
And the CSI is 0.70. The BDT model performance at 
identifying GDP implemented days is at least as good 
as the BDT model for prediction of GS days. 


Table 9. Prediction of the EWR GDP Days 


EWR GDP Day 
Predictions 

Actual Observation 

Yes 

No 

Sum 

BDT 

Prediction 

Yes 

117 

21 

138 

No 

30 

199 

229 

Sum 

147 

220 

367 


OAR;86%, POD;80%, PFA;15%,CSI;0.70 


5.3 Prediction of GS and GDP days 

For distinguishing the GS and GDP days from 
the normal (Non-GDP/Non-GS) days, the data 
were grouped into the two the same way as 
before, i.e., the "Yes” class was to indicate that 
both GS and GDP were required on a particular 
day, while the class "No” to indicate none of GDP 
or GS were required on a given day. The results are 
shown in Table 10. The accuracy of the BDT 
classifier, OAR, is 0.85. The precision (POD) and 
false alarm (PFA) is 0.88 and 0.15, respectively. The 
Critical Success Index (CSI) is 0.76. 


Table 10. Prediction of EWR GS and GDP Days 


EWR GS/GDP Day 
Predictions 

Actual Observation 

Yes 

No 

Sum 

BDT 

Prediction 

Yes 

246 

43 

289 

No 

33 

177 

210 

Sum 

279 

220 

387 


OAR:85%, POD;88%, PFA;15%,CSI:0.76 


5.4 Prediction of GSs on GDP days 

The ability to predict the days requiring GS 
operations on the GDP days may help TFM 
specialist to adjust the GDP parameters (such as 
the start time, affected flights, etc.) to increase the 
predictability of TFM actions. This is a difficult 
problem because the weather situations for using 
GDP or both GDP and GS were similar. As usual, 
the GDP days were labeled as either a "Yes”, for 


those having at least one GS operation on a GDP 
day, or a “No" otherwise. The classification results 
are shown in Table 11 with OAR=71%, POD=86%, 
PFA=0.26, and CSI=0.66. 


Table 11. Prediction of GS implemented in GDP 
Days 


EWR GS 

Predictions for the 
GDP days 

Actual 

Observation 

Yes 

No 

Sum 

BDT 

Prediction 

Yes 

239 

82 

321 

No 

40 

65 

105 

Sum 

279 

147 

426 


OAR:71%, POD:86%, PFA:26%,CSI:0.66 


The BDT AAR model predictions using 6-hour 
look-ahead RUC forecast performed reasonably well 
in this GS and/or GDP day prediction study. The 
overall prediction accuracies are about 85% with the 
precisions ranged from 74% to 88% and the false 
alarm ratio from 10% to 15% to distinguish GS, GDP 
or GS and GDP days from normal days. To 
discriminate GS and GDP days from the GDP days, 
the overall prediction accuracy, the precision, and the 
false alarm ratio are 71%, 86%, and 26%, 
respectively. 


learning is employed to train the BDT binary 
classification models. The models are validated 
using data cross validation methods. When 
predicting the occurrence of GS, GDP, and GS/GDP 
from the normal days, the model was able to achieve 
an overall accuracy rate about 85%. In the study to 
distinguish the GS/GDP days from GDP/Non-GS 
days an overall accuracy rate of 71% was achieved. 

In summary, the predictions proposed here by 
the BDT model provide an approach to understanding 
and accounting for the uncertainty in demand and 
weather impacted capacity and how to learn from the 
past experience. The study provides information that 
may be useful in improving FAA TFM daily 
operations. 
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